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SHUFFLING CARDS AND STOPPING TIMES 


DAVID ALDOUS* 

Department of Statistics, University of California, Berkeley, CA 94720 


PERSI DIACONIS** 

Department of Statistics, Stanford University, Stanford, CA 94305 

I. Introduction. How many times must a deck of cards be shuffled until it is close to random? 
There is an elementary technique which often yields sharp estimates in such problems. The 
method is best understood through a simple example. 

Example 1. Top in at random shuffle. Consider the following method of mixing a deck of 
cards: the top card is removed and inserted into the deck at a random position. This procedure is 
repeated a number of times. The following argument should convince the reader that about 
n log n shuffles suffice to mix up n cards. The argument depends on following the bottom card of 
the deck. This card stays at the bottom until the first time (7\) a card is inserted below it. 
Standard calculations, reviewed below, imply this takes about n shuffles. As the shuffles continue, 
eventually a second card is inserted below the original bottom card (this takes about n/2 further 
shuffles). Consider the instant ( T 2 ) that a second card is inserted below the original bottom card. 
The two cards under the original bottom card are equally likely to be in relative order low-high or 
high-low. 

Similarly, the first time a third card is inserted below the original bottom card, each of the 6 
possible orders of the 3 bottom cards is equally likely. Now consider the first time T n _ x that the 
original bottom card comes up to the top. By an inductive argument, all (n - 1)! arrangements of 
the lower cards are equally likely. When the original bottom card is inserted at random, at time 
T = T n _ l + 1, then all n \ possible arrangements of the deck are equally likely. 
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Fig. 1. Example of repeated top in at random shuffles of a 4-card deck. 


When the original bottom card is at position k from the bottom, the waiting time for a new 
card to be inserted below it is about n/k. Thus the waiting time T for the bottom card to come to 
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the top and be inserted is about 

n n n 

« + - + -+ • • • +- = n log n. 

2 3 n 

This paper presents a rigorous version of the argument and illustrates its use in a variety of 
random walk problems. The next section introduces the basic mathematical set up. Section 3 
details a number of examples drawn from applications such as computer generated pseudo 
random numbers. Section 4 treats ordinary riffle shuffling, analyzing a model introduced by 
Gilbert, Shannon, and Reeds. Section 5 explains a sense in which the method of stopping times 
always works and compares this to two other techniques (Fourier analysis and coupling). Some 
open problems are listed. 

2. The Basic Set-Up. Repeated shuffling is best treated as random walk on the permutation 
group S n . For later applications, we treat an arbitrary finite group G. Given some scheme for 
randomly picking elements of G, let Q(g) be the probability that g is picked. The numbers 
{Q(g) : g e G} are a (probability) distribution : Q(g) > 0 and 2Q(g) = 1. Repeated indepen- 
dent picks according to the same scheme yield random elements ( l9 £ 2 > £ 3 ? • • • > °f Define the 
products 

X 0 = identity 
*i-€i 

Xk = Zk^k-i = £k%k - 1 ’ ' ’ £l* 

The random variables X 0 , X l9 X 2 , . . . , are the random walk on G with step distribution Q. Think 
of X k as the position at time k of a randomly-moving particle. The distribution of X 2 , that is the 
set of probabilities P(X 2 = g), g e G , is given by convolution 

P(X 2 = s) = Q*Q(g) = L Q(h)Q(gh- 1 ). 

h^G 

For Q(h)Q(gh~ 1 ) is the chance that element h was picked first and gh~ l was picked second; for 
any h, this makes the product equal to g. Similarly, P(X k = g) = Q k *(g), where Q k * is the 
repeated convolution 

(2.1) Q k ' = e*0(*-D* = £ Q{h)Q^\gh~ l ). 

h^G 

In modelling shuffling of an n- card deck, the state of the deck is represented as a permutation 
7T e S n , meaning that the card originally at position i is now at position 7r(i). 

In Example 1, G = S„, and using cycle notation for permutations 77 , 

Q(i, i - 1,...,1) = 1 /n, 1 < i < n, 

Q(n) = 0, else. 

Here £* is a randomly chosen cycle, X k is the state of the deck after k shuffles, and ^^*( 77 ) is the 
chance that the state after k shuffles is 77 . In Fig. 1, = (3,2,1), £ 2 = (3,2,1) and X 2 = £f£ 2 = 

(1 ’ 2 ’ 3) ‘ 

We shall study the distribution Q k *. Note that Q k * can be defined by (2.1) without using the 
richer structure of the random walk (X k ); however, this richer structure is essential for our 
method of analysis. 

A fundamental result is that repeated convolutions converge to the uniform distribution U: 

(2.2) Q k \g) -» U(g) = 1/\G\ as k -» oo, 

unless Q is concentrated on some coset of some subgroup. This was first proved by Markov 
(1906) — see Feller (1968), Section 15.10 for a clear discussion — and can nowadays be regarded as 
a special case of the basic limit theory of finite Markov chains. Poincare (1912) gave a Fourier 
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analytic proof, and subsequent workers have extended (2.2) to general compact groups — see 
Grenander (1963), Heyer (1977), Diaconis (1982) for surveys. A version of this result is given here 
as Theorem 3 of Section 3. 

Despite this work on abstracting the asymptotic result (2.2), little attention has been paid until 
recently to the kind of non-asymptotic questions which are the subject of this paper. 

A natural way to measure the difference between two probability distributions Q x , Q 2 on G is 
by variation distance 

II Qi ~ QiW = \Z\Qi{g) ~ Qi(g )\ • 

There are equivalent definitions 

(2.3) || & - Q 2 \\ = maxie^) - Q 2 (A)\ = - max \Q x (f) - Q 2 (f)\, 

AczG ^ 11/11=1 

where Q(A) =I gG/ *0(g), Q(f) = Zf(g)Q(g), and ||/|| = max|/(g)|. The string of equalities is 
proved by noting that the maxima occur for A = {g: Q x (g) > C 2 (g)} and for /= 1 A — lj. 
Thus, two distributions are close in variation distance if and only if they are uniformly close on all 
subsets. Plainly 0 < \\Q X - Q 2 \\ < 1. 

An example may be useful. Suppose, after well- shuffling a deck of n cards, that you happen to 
see the bottom card, c. Then your distribution Q on S n is uniform on the set of permutations v 
for which *tt(c) = n , and \\Q - U\\ = 1 — 1/n. This shows the variation distance can be very 
“ unforgiving” of small deviations from uniformity. 

Given a distribution Q on a group G , (2.2) says 

def 

(2.4) d Q (k) = \\Q k * — £/|| -» 0 as k -> oo. 

Where Q models a random shuffle, d(k) measures how close k repeated shuffles get the deck 
to being perfectly (uniformly) shuffled. One might suppose d(k) decreases smoothly from (near) 1 
to 0; and it is not hard to show d(k) is decreasing. However, 

Theorem 1. For the 66 top in at random ” shuffle , Example 1, 

(a) d{n log n + cn) < e~ c \ all c > 0, n > 2. 

(b) d(n log n - c n n) -> 1 as n -> oo; all c n -> oo. 

This gives a sharp sense to the assertion that n log n shuffles are enough. This is a particular 
case of a general cut-off phenomenon , which occurs in all shuffling models we have been able to 
analyze; there is a critical number k 0 of shuffles such that d(k 0 + o(k 0 )) - 0 but d(k 0 - o(k 0 )) 
* 1. (See Fig. 2.) 



Fig. 2 
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Our aim is to determine k 0 in particular cases. This is quite different from looking at asymptotics 
in (2.4): it is elementary that d(k) -> 0 geometrically fast, and Perron-Frobenius theory says 
d(k) ~ a\ k , where a, X have eigenvalue/eigen vector interpretation, but these asymptotics miss 
the cut-off phenomenon. For card players, the question is not “exactly how close to uniform is the 
deck after a million riffle shuffles?”, but “is 7 shuffles enough?”. 

The main purpose of this paper is to show how upper bounds on d(k ), like (a) in Theorem 1, 
can be obtained using the notion of strong uniform times, which we now define in two steps. 

Definition 1. Let G be a finite group, and G 00 the set of all G-valued infinite sequences 
g = (#i , Si >•••)• A stopping rule T is a function T: G 00 -> (1, 2, 3, . . . ; oo } such that if T(g) = j , 
then T{ g) =j for all g with g, = g,, i <y. 

Definition 2. Let (7 be a distribution on G, and let (X k ) be the associated random walk. 
Given a stopping rule T, the random time T = T(X l9 X 2 , . . . ) is a stopping time . Call T a strong 
uniform time (for U) if for each k < oo 

(a) P(T = k, X k = g) does not depend on g. 

Remark (i). Note that (a) is equivalent to 

(b) P(X k =g\T=k) = l/\G\,geG 

and to 

(c) P(X k -g\T<k)-l/lGl; g<=G. 

Remark (ii). Picture the process of picking group elements and multiplying. A stopping time is 
a rule which tells you when to “stop” with the current value of the product. The time is strong 
uniform if, conditional on stopping after k steps, the value of the product is uniform on G. 

Remark (iii). In Example 1, we defined a time T as the first time that the original bottom card 
has come to the top and been inserted into the deck. This is certainly a stopping time, and the 
inductive argument in Section 1 shows that, given T = k, all arrangements of the deck are equally 
likely. 

Remark (iv). In practice it is often useful to have a slightly more general notion of stopping 
time, which allows the decision on whether or not to stop at n to depend not only on ( X l9 . . . , X n ) 
but also on the value of some random quantity Y independent of the X process. Such a time T is 
called a randomized stopping time T ; our results extend to this case without essential change. 

Here is a basic upper bound lemma which relates strong uniform times to the distance between 
Q k * and the uniform distribution. 

Lemma 1. Let Q be a probability distribution on a finite group G. Let T be a strong uniform time 
for Q. Then 

d(k) s ||e** - U\\< P(T> k) 9 all A: > 0. 

Proof . For any A c G 

Q k \A) =P{X k <=A) 

= £ P{X k *A,T=j) + P(X k eA,T> k) 

j<k 

= E U(A)P(T=j) +P(X k <= A\T >k)P{ T> k) 

j<k 

= U(A) + {P(X k e A\T > k) - U(A)}P(T> k) 

and so 

\Q k \A) - U(A)\^P(T> k). □ 
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We conclude this section by using Lemma 1 and elementary probability concepts to prove 
Theorem 1. Here is one elementary result we shall use in several examples. 

Lemma 2. Sample uniformly with replacement from an urn with n balls. Let V be the number of 
draws required until each ball has been drawn at least once. Then 

P( V > n log n 4- cn) < e~ c \ c > 0, n > 1. 


Proof. Let m = n log n + cn. For each ball b let A b be the event “ball b not drawn in the 
first m draws”. Then 

P(V> m) = p(LM fc ) < E^K) = «(l - -J) 

< n exp( — m/n) = e~ c . □ 


Remark. This is the famous “coupon-collector’s problem”, discussed in Feller (1968). The 
asymptotics are P(V > n log n + cn) -> 1 — exp( — e~ c ) as n -> oo, c fixed. So for c not small 
the bound in Lemma 2 is close to sharp. 


Proof of Theorem 1. Recall we have argued that T , the first time that the original bottom card 
has come to the top and been inserted into the deck, is a strong uniform time for this shuffling 
scheme. We shall prove that T has the same distribution as V in Lemma 2; then assertion (a) is a 
consequence of Lemmas 1 and 2. 

We can write 


(2.5) T=T l + {T 2 -T l ) + ■ ■■ +(T„_! - T n _ 2 ) + (T - T„_ 2 ), 


where 7] is the time until the i th card is placed under the original bottom card. When exactly i 
cards are under the original bottom card b , the chance that the current top card is inserted below 
i + 1 

b is , and hence the random variable T l+1 - T t has geometric distribution 

n 


( 2 . 6 ) 


P(T i+1 ~T l =j) = 


i 4- 1 


i + 1 y- 1 
n J 


j > 1. 


The random variable V in Lemma 2 can be written as 


(2.7) v=(v- v n _ t ) + (r„_ 1 - v„_ 2 ) + ■■■+(v 2 -v 1 ) + v u 


where V t is the number of draws required until i distinct balls have been drawn at least once. 
After i distinct balls have been drawn, the chance that a draw produces a not-previously-drawn 
n — i 

ball is . So V t - V i _ 1 has distribution 

n 


P(V.-Vi - 1 -j) 


n - i 


1 - 


n - i 


> 1 . 


Comparing with (2.6), we see that corresponding terms ( T i+1 - T t ) and ( V n _ t - V n _ i _ 1 ) have the 
same distribution; since the summands within each of (2.5) and (2.7) are independent, it follows 
that the sums T and V have the same distribution, as required. 

To prove (b), fix j and let Aj be the set of configurations of the deck such that the bottom j 
original cards remain in their original relative order. Plainly U(Aj ) = 1 //! Let k = k(n) be of the 
form n log n - c n n, c n -> oo. We shall show 


( 2 . 8 ) 


Q k \Aj) -> 1 as n -> oo; j fixed. 


Then d(k) > max {Q k *(Aj) - U(Aj )} -> 1 as n oo, establishing part (b). 

To prove (2.8), observe that Q k \Af) > P(T - 7J_ X > k). For T - 7J_ X is distributed as the 
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time for the card initially j th from bottom to come to the top and be inserted; and if this has not 
occurred by time k , then the original bottom j cards must still be in their original relative order at 
time k. Thus it suffices to show 

(2.9) P(T- T J _ 1 < k) -* 0 as n oo ; j fixed. 

We shall prove this using Chebyshev’s inequality : 

var( Z) 

P( | Z - EZ | > a) < 1 — , where a > 0, and Z is any random variable. 

a 


From (2.6), 


n ( n \ 2 ( i + 1 

^•■-W-TTT • 7TT — 


and so from (2.5) 


n - 1 


E(T — Tj) = £ — - = «logn + 0(n) ) 
i-y 1 + 1 


and Chebyshev’s inequahty apphed to Z = T — readily yields (2.9). □ 

Remark. Note that the “strong uniform time” property of T played no role in establishing the 
lower bound (b). Essentially, we get lower bounds by guessing some set A for which \Q k *(A) - 
U(A ) | should be large, and using the obvious (from (2.3)) inequality 

d{k) =\\Q k * - U\\>\$ k '{A) - U{A)\. 


3. Examples. We present constructions of strong uniform times for a variety of random walks: 
simple random walk on the circle, general random walks on finite groups, and a random walk 
arising in random number generation. Sometimes our arguments give the optimal rate, often they 
give the correct order of magnitude. 

Example 2. Simple random walk on the integers mod n. Let n be a positive odd integer. Let Z n 
be the integers mod n, thought of as n points on a circle. Imagine a particle which moves by steps, 
each step being equally likely to move 1 to the right or 1 to the left. This random walk has step 
distribution Q on Z n ; 

(3.i) e(i) = e(-i) = i. 

The following theorem shows that the number of steps k required to become uniform is slightly 
more than n 2 . 

Theorem 2. Let n > 3 be an odd integer. For simple random walk on the integers mod n defined 
by (3.1), for k > n 2 , 

d(k) < 6e~ ak/n2 

with a = 4tt 2 /3. 

Proof . First consider n = 5 and the following 5 patterns 

H — I , H , — I — I — h , H — I — ^ , . 

A sequence of successive steps of the walk on Z 5 yields a sequence of + symbols. Consider the 
sequence in disjoint blocks of 4. Stop the first time T that a block of 4 equals one of the above 5 
patterns. Thus, if the sequence starts + H F , H — I — I — , + H , T=12. 
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This stopping time is clearly a strong uniform time; given that T = 12, all 5 final positions in 
Z 5 are equally likely. Such sets of k - tuples can be chosen for any odd n. It turns out that to get 
the correct rate of convergence, k should be chosen as a large multiple of n 2 . Here are some 
details. 

For fixed integers n and k , with n odd, let Bj be the set of binary A> tuples with j pluses 
(mod n). 

Let y* be the index corresponding to the smallest \Bj*\. Partition the set of binary k-tuples 
into n groups of size \Bj*\, the y'th group being chosen arbitrarily from B y The random walk 
generates a sequence of + symbols. Consider these in disjoint blocks of length k. Define T as the 
first time a block equals one of the chosen group. This clearly yields a strong uniform time. The 
following lemma gives an explicit upper bound for d(k). 


Lemma 3. Let T be as defined above. For n > 3 and k > n 2 , 


with a = 4tt 2 /3. 


P(T> k) ^ 6e~ ak/ " n2 


Proof. The number of elements in Bj is 


E 

/> 0 



9 k n - 1 

— E« 


» /- 0 


— 277 / 7 j* 
n 



k 


this being a classical identity due to C. Ramus (see Knuth (1973, p. 70)). The chance of a given 
block falling in the chosen group equals 



n-i 


E e 

1 = 0 



k 


Now 


P(T > k) = p(l-p)+p(l-p) 2 +p(l-p) 3 + 


n - 1 

1 -p < E 
1=1 


2 77 l 


cos- 


Straightforward calculus using quadratic approximations to cosine such as cos x < 1 < 

— 2 /? 3 

e x/ for 0 < x < *n / 2 leads to the stated result. Further details may be found in Chung, 

Diaconis, and Graham (1986). □ 


Remark. There is a lower bound for d(k) of the form ae~^ k/n2 for positive a and ft, so 
somewhat more than n 2 steps really are required. One way to prove this is to use the central limit 
theorem; this implies that after k steps the walk has moved a net distance of order k 1/2 . Hence we 
need k of order n 2 at least in order that the distribution after k steps is close to uniform. Further 
details are in Chung, Diaconis and Graham (1986). 

There is a sense in which the cutoff phenomenon does not occur for this example. It is possible 
to show there is a continuous function d*(t), with d*(t ) -> 0 as t oo, such that for simple 
random walk on Z n , 


max| d(k) - d*(k/n 2 )\ 0 as n -> oo. 

k 


Indeed, as n -» oo , a rescaled version of the random walk tends to Brownian motion on the circle. 
The function d*(t) is the variation distance to uniformity for Brownian motion at time t. 


Example 3. A bound for general problems. Let G be a finite group and Q a probability on G. 
The following result shows that Q* k converges to the uniform distribution geometrically fast 
provided Q is not concentrated on a subgroup or a translate of a subgroup. To see the need for 
this condition, consider Example 2 above (simple random walk on Z n ). If n is even, then the 
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particle is at an even position after an even number of steps — the distribution never converges to 
uniform. 

A simple way to force convergence is the following: 

(3.2) Suppose for some k 0 and 0 < c < 1, Q* k °(g ) > cU(g) for all g e G. 

Theorem 3. Condition (3.2) implies 

d(k) < (1 - c ) [k/k ° ] for all k > k 0 . 

Proof. The argument proceeds by constructing another process which behaves like the original 
random walk but easily exhibits a strong uniform time. Suppose first that k 0 = 1, so Q(g) > cU(g) 
for all g. Define 

R(g) =[Q(g) - cU( g )]/[ l-c]. 

Observe that R(g) is a probability and 

(3.3) Q(g)=(l-c)R(g) + cU(g). 

Consider a new random walk defined as follows. For each step, flip a coin with probability of 
heads c. If the coin comes up heads, take the step according to U(g). If the coin comes up tails, 
take the step according to R(g). Because of (3.3), each step is taken according to Q overall. Let T 
be the first time that the coin comes up heads. Then T is a (randomized) stopping time and 
because the convolution of the uniform distribution with any distribution is uniform, T is a strong 
uniform time. 

Clearly, 

P{T> k) = (1 - c) k . 

For k 0 > 1, apply the argument to the probability Q* k °. □ 

Remark (i). The argument given is valid for a probability on a general compact group. In this 
form, Theorem 3 is due to Kloss (1959). The proof we give is very close to techniques exploited by 
Athreya and Ney (1977) for general state space Markov processes. 

Remark (ii). While Theorem 3 seems quantitative, the simplicity of the argument should make 
one suspicious. The reader can see the difficulty by trying to get a rate of convergence for simple 
random walk on Z n . Estimating c and k 0 is not an easy problem, we do not know how to use 
Theorem 3 to get the correct rate of convergence for any non-trivial problem. 

Example 4. A random walk on Z n arising in random number generation. Random number 
generators often work by recursively computing X k+1 = aX k + b (modulo n ), where a , b and n 
are chosen carefully— see Knuth (1981). Of course the sequence X k is really deterministic and 
exhibits many regularities. To improve things, various schemes have been suggested involving 
combining several generators. In one scheme, a and b are chosen each step from another 
generator. If this second source is considered truly random (it may b6 the result of a physical 
generator using a radioactive source) one may inquire how long it takes X k to become random. 

For example, if a = 1 and b = 0, + 1, or - 1 each with probability 1/3, the process becomes 
simple random walk on Z n : X k = X k _ x + 6,/mod n) with a slightly different step size than 
considered in Example 2. The argument given there can easily be adapted to show that slightly 
more than n 2 steps are required to become random. 

We now consider the effect of a deterministic doubling: 

1 

(3.4) X k = 2X k _ l + ^(mod n ), b k = 0, + 1 with probability - . 

We will show that this dramatically speeds things up: from n 2 down to log n log log n. The 
argument is presented as a non-trivial illustration of the method of strong uniform times. It 
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involves a novel construction of an almost uniform time. For simplicity, we take n = 2 l - 1 (a 
common choice in the application). 

Theorem 4. Let Q k be the probability distribution of X k defined by (3.4) with n = 2 l — 1. Let 
d(k) = || Q k - U ||. Then 

1 

d( cl log /) ~ * 0 as / -> oo , for c > - — - . 


Proof. Observe first that if 8 t takes values +1 with probability 1/2, then 


U* = 2 l ~% + 2 l ~ 2 8 2 + ••• +8 l 
is very close to uniformly distributed mod2 z - 1. Indeed, 


Thus 

(3.5) 


P(U* = j( modi' - 1)) 



j * 0, 
7 = 0. 


~ u "- -h-ibv 


The argument proceeds by finding a stopping time T such that the process stopped at time T has 
distribution at least as close to uniform as 1/*. An appropriate modification of the upper bound 
lemma will complete the proof. We isolate the steps as a sequence of lemmas. The first and second 
lemmas are elementary with proofs omitted. 


Lemma 4. Let X l9 X l9 ... be a process with values in a finite group G. Write Q k for the 
probability distribution of X k . Let T be a stopping time with the property that for some e > 0, 

\\Q k (-\T-j)-U\\<e; allj < k. 

Then 


\\Q k ~ f/|| < e + P(T > k). 


Lemma 5. Let Q x and Q 2 be probability distributions on a finite group G. Then 

ii QiQi - u\\ < lie, - u\\. 

To state the third lemma, an appropriate stopping time T must be defined. Using the defining 
recurrence X k = 2X k _ x + b k ( mod n), 

(3.6) X k = 2 k ~% + 2 k ~ 2 b 2 + • • • +b k ( mod n). 

Since n = 2 l — 1, 2 l = l(mod n). Group the terms on the right side of (3.6) by distinct powers 
of 2: 


X k = 2 / " 1 T 1 + 2 l ~ 2 A 2 + • • • +y4 / (mod n) 

with 

= b\ + b /+ 1 + b 2i+ i . . . , A 2 = b 2 + b/+ 2 + • • • , etc. 

Define T as the first time each of the sums A l9 A 29 ... 9 A/ contains at least one non-zero 
summand. 


Lemma 6. The probability distribution of X k given T = j < k is the convolution of 17* defined 
above with an independent random variable. 



342 


DAVID ALDOUS AND PERSI DIACONIS 


[May 


Proof. Let 8? be the first non-zero summand in A t . Write 

X k = [2'-%* 4 - 2 /_2 8* + • • • +5,*] 4 - [2 l ~\A l - ) + • • • +{A, - «,*)] . 

Clearly the first term on the right has distribution £/*. Further,, given all the remaining values of 
b k , and the labels of 8*, all 2 l values of Sf, . . . , 8f are equally likely, so the decomposition of X k 
is into independent parts. □ 


Using Lemmas 5, 6, along with the bound (3.5) allows us to take e = 2/2 1 in Lemma 4 for this 
stopping time T. To complete the proof of Theorem 4, it only remains to estimate P(T > k). 
Toward this end, consider k = al for integer a , 


P(T>al) = l-\l-\ [ - 

log / + c 

For large /, this is approximately 1 — exp{ — le~ a log3 }. If a = for some value of c, this 

l°g 3 

becomes 1 — exp{ — e c ) which is well approximated by e c for large c. It follows that for c 
l log / 

large, - — — + cl steps suffice to be close to uniform. This is more than was claimed in Theorem 


4. □ 


log 3 


Remark. Chung, Diaconis and Graham (1986) give a more detailed analysis, showing that 
/ log / is the correct order of magnitude. 

4. An Analysis of Riffle Shuffles. In this section we analyze the most commonly used method 
of shuffling cards — the ordinary riffle shuffle. This involves cutting the deck approximately in 
half, and interleaving (or riffling) the two halves together. We begin by introducing a mathemati- 
cal model for shuffling suggested by Gilbert, Shannon and Reeds. Following Reeds, we introduce 
a strong uniform time for this model and show how the calculations reduce to simple facts about 
the birthday problem. 

The diagram gives the result of a single riffle shuffle of a 10 card deck in the usual i -> 7 r(i) 
format 


0 

0 

0 

0 

1 

1 

1 

1 

1 

1 


1 

0 

1 

0 

0 

1 

0 

1 

1 

1 


1 -rr(i) 

1 2 

2 4 

3 5 

4 7 

5 1 

6 3 

7 6 

8 8 

9 9 

10 10 


This shuffle is the result of cutting 4 cards off the top of a 10 card “deck” and riffling the packets 
together, first dropping cards 10, 9, 8, then card 4, then 7, and so on. 

This permutation has two rising sequences 

77(1) < 77(2) < 77(3) < 77(4) and 77(5) < 77(6) < 77(7) < 77(8) < 77(9) < 77(10). 

In general, a permutation 77 of n cards made by a riffle shuffle will have exactly 2 rising sequences 
(unless it is the identity, which has 1). Conversely, any permutation of n cards with 1 or 2 rising 
sequences can be obtained by a physical riffle. Thus the mathematical definition of a riffle shuffle 
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is “a permutation with 1 or 2 rising sequences”. Suppose c cards are initially cut off the top. Then 
there are ^ j possible riffle shuffles (1 of which is the identity). To see why, label each of the c 
cards cut with “0” and the others with “1”. After the shuffle, the labels form a binary n - tuple with 
c “0”s: there are ^ j such n-tuples and each corresponds to a unique riffle shuffle. Finally, the 
total number of possible riffle shuffles is 

1+ £{(")- l}- 2- 

c = 0 

Some stage magicians can perform “perfect” shuffles, but for most of us the result of a shuffle 
is somewhat random. The actual distribution of one shuffle (that is, the set of probabilities of each 
of the 2 n — n possible riffle shuffles) will depend on the skill of the individual shuffler. The 
following model for random riffle shuffle, suggested by Gilbert and Shannon (1955) and Reeds 
(1981), is mathematically tractable and qualitatively similar to shuffles done by amateur card 
players. 

1st description. Begin by choosing an integer c from 0, 1, . . . , n according to the binomial 

distribution P{C = c) = — ( Then, c cards are cut off and held in the left hand, and n - c 

cards are held in the right hand. The cards are dropped from a given hand with probability 
proportional to packet size. Thus, the chance that a card is first dropped from the left hand packet 
is c/n. If this happens, the chance that the next card is dropped from the left packet is 
(c - 1 )/(n - 1). 

There are two other descriptions of this shuffling mechanism that are useful. 

2nd description. Cut an n card deck according to a binomial distribution. If c cards are cut off, 
pick one of the ^ j possible shuffles uniformly. 

3rd description. This generates tt~ 1 with the correct probability. Label the back of each card 
with the result of an independent, fair coin flip as 0 or 1. Remove all cards labelled 0 and place 
them on top of the deck, keeping them in the same relative order. 

Lemma 7. The three descriptions yield the same probability distribution. 

Proof. The second and third descriptions are equivalent. Indeed, the binary labelling chooses a 
binomially distributed number of zeros, and conditional on this choice, all possible placements of 
the zeros are equally likely. 

The first and second descriptions are equivalent. Suppose c cards have been cut off. For the 
first description, a given shuffle is specified by a sequence D l9 D 2 , . . . , D n , where each D t can be 
L or R and c of the Df s must be L. Under the given model, the chance of all such sequences, 
determined by multiplying the chance at each stage, is c\(n — c)\/n\ □ 

The argument to follow analyzes the repeated inverse shuffle. This has the same distance to 
uniform as repeated shuffling because of the following lemma. ' 

Lemma 8. Let G be a finite group , T : G -> G one-to-one , and Q a probability on G. Then 

lie - U\\ = \\QT- 1 - U\\, 

where QT~ l (g) = Q(T~ l (g)) is the probability induced by T. □ 


The results of repeated inverse shuffles of n cards can be recorded by forming a binary matrix 
with n rows. The first column records the zeros and ones that determine the first shuffle, and so 
on. The i th row of the matrix is associated to the i th card in the original ordering of the deck, 
recording in coordinate j the behavior of this card on the y th shuffle. 
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1 

2 

3 

4 

1101 

c 0010 

c 0010 

/ 1000 

/ 1000 

1100 

e 0110 

d 1011 

a 1101 

b 1100 

0010 

a 1101 

/ 1000 

b 1100 

c 0010 

1011 

b 1100 

e 0110 

c 0010 

e 0110 

0110 

d 1011 

a 1101 

d 1011 

a 1101 

1000 

/ 1000 

b 1100 

e 0110 

d 1011 


Lemma 9 (Reeds). Let T be the first time that the binary matrix formed from inverse shuffling has 
distinct rows. Then T is a strong uniform time. 

Proof. The matrix can be considered as formed by flipping a fair coin to fill out the i, j entry. 
At every stage, the rows are independent binary vectors. The joint distribution of the rows, 
conditional on being all distinct, is invariant under permutations. 

After the first inverse shuffle, all cards associated to binary vectors starting with 0 are above 
cards with binary vectors starting with 1. After two shuffles, cards associated with binary vectors 
starting (0,0) are on top followed by cards associated to vectors beginning (1,0), followed by 
(0, 1), followed by (1, 1) at the bottom of the deck. 

Inductively, the inverse shuffles sort the binary vectors (from right to left) in lexicographic 
order. At time T the vectors are all distinct, and all sorted. By permutation invariance, any of the 
n cards is equally likely to have been associated with the smallest row of the matrix (and so be on 
top). Similarly, at time T, all n\ orders are equally likely. □ 

To complete this analysis, the chance that T > k must be computed. This is simply the 
probability that if n balls are dropped into 2 k boxes there are not two or more balls in a box. If 
the balls are thought of as people, and the boxes as birthdays, we have the familiar question of the 
birthday problem and its well-known answer. This yields: 


Theorem 5. For Q the Gilbert- Shannon- Reeds distribution defined in Lemma 7, 


n - 1 


(4.1) 


\\Q***-uw<p(T>k) = i- n 1-5 • 


Standard calculus shows that if k = 2 log 2 («/c), 

, N n— * oo 

P(T > k) ~ 1- 


/= l 


2 > 0 1 2 
e ~ 2 C - 


In this sense, 2 log 2 n is the cut off point for this bound. Exact computation of the right side of 
(4.1) when n = 52 gives the bounds 


k 

upper bound 

10 

.73 

11 

.48 

12 

.28 

13 

.15 

14 

.08 


Remark (a). The lovely new idea here is to consider shuffling as inverse sorting. The argument 
works for any symmetric method of labelling the cards. For example, biased cuts can be modeled 
by flipping an unfair coin. To model cutting off exactly j cards each time, fill the columns of the 
matrix with the results of n draws without replacement from an urn containing j balls labelled 
zero and n — j balls labelled one. These lead to slightly unorthodox birthday problems which turn 
out to be easy to work with. 
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Remark (b). The argument can be refined. Suppose shuffling is stopped slightly before all 
rows of the matrix are distinct — e.g., stop after 2 log n shuffles. Cards associated to identical 
binary rows correspond to cards in their original relative positions. It is possible to bound how far 
such permutations are from uniform and get bounds on \\Q* k - U ||. Reeds (1981) has used such 
arguments to show that 9 or fewer shuffles make the variation distance small for 52 cards. 

Remark (c). A variety of ad hoc techniques have been used to get lower bounds. One simple 
method that works well is to simply follow the top card after repeated shuffles. This executes a 
Markov chain on n states with a simple transition matrix. For n in the range of real deck sizes, 
n X n matrices can be numerically multiplied and then the variation distance to uniform 
computed. Reeds (1981) has carried this out for decks of size 52 and shown that || Q* 6 - U || ^ .1. 
Techniques which allow asymptotic verification that k = (3/2)log 2 n is the right cutoff for large n 
are described in Aldous (1983a). These analyses, and the results quoted above, suggest that seven 
riffle shuffles are needed to get close to random. 

Remark (d). Other mathematical models for riffle shuffling are suggested in Donner and 
Uppulini (1970), Epstein (1977), and Thorp (1973). Borel and Cheron (1955) and Kosambi and 
Rao (1958) discuss the problem in a less formal way. Where conclusions are drawn, 6 to 7 shuffles 
are recommended to randomize 52 cards. 

Remark (e). Of course, our ability to shuffle cards depends on practice and agility. The model 
produces shuffles with single cards being dropped about 1/2 of the time, pairs of cards being 
dropped about 1/4 of the time, and i cards blocks being dropped about 1/2' of the time. 
Professional dealers drop single cards 80% of the time, pairs about 18% of the time and hardly 
ever drop 3 or more cards. Less sophisticated card handlers drop single cards about 60% of the 
time. Further discussion is in Diaconis (1982) or Epstein (1977). 

It is not clear if neater shuffling makes for a better randomization mechanism. After all, eight 
perfect shuffles bring a deck back to order. Diaconis, Kantor, and Graham (1983) contains an 
extensive discussion of the mathematics of perfect shuffles, giving history and applications to 
gambling, computer science and group theory. 

The shuffle analyzed above is the most random among all single shuffles with a given 
distribution of cut size, being uniform among the possible outcomes. It may therefore serve as a 
lower bound; any less uniform shuffle might take at least as long to randomize things. Further 
discussion is in Mellish (1973). 

Remark (f). One may ask, “Does it matter?” It seems to many people that if a deck of cards is 
shuffled 3 or 4 times, it will be quite mixed up for practical purposes with none of the esoteric 
patterns involved in the above analysis coming in. 

Magicians and card cheats have long taken advantage of such patterns. Suppose a deck of 52 
cards in known order is shuffled 3 times and cut arbitrarily in between these shuffles. Then a card 
is taken out, noted and replaced in a different position. The noted card can be determined with 
near certainty! Gardner (1977) describes card tricks based on the inefficiency of too few riffle 
shuffles. 

Berger (1973) describes a different appearance of pattern. He compared the distribution of 
hands at tournament bridge before and after computers were used to randomize the order of the 
deck. The earlier, hand shuffled, distribution showed noticeable patterns (the suit distributions 
were too near “even” 4333) that a knowledgeable expert could use. 

It is worth noting that it is not totally trivial to shuffle cards on a computer. The usual method, 
described in Knuth (1981), goes as follows. Imagine the n cards in a row. At stage /, pick a 
random position between i and n and switch the card at the chosen position with the card at 
position i. Carried out for 1 < i < n - 1, this results in a uniform permutation. In the early days 
of computer randomization, we are told that Bridge Clubs randomized by choosing about 60 
random transpositions (as opposed to 51 carefully randomized transpositions). As the analysis of 
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Diaconis and Shahshahani (1981) shows, 60 is not enough. 

Remark (g). While revising this paper we noted the following question and answer in a 
newspaper bridge column (“The Aces”, by Bobby Wolff). 

Q: How many times should a deck be shuffled before it is dealt? My fellow players insist on at 

least seven or eight shuffles. Isn’t this overdoing it? 

A: The laws stipulate that the deck must be “thoroughly shuffled”. While no specific number 

is stated, I would guess that five or six shuffles would be about right; seven or eight would 
not be out of order. 

5. Other Techniques and Open Problems. A number of other natural random walks admit 
elegant analyses with strong uniform times. For example, Andre Broder (1985) has given stopping 
times for simple random walk on the “cube” Zf , and for the problem of randomizing n cards by 
random transpositions. We can similarly analyze nearest neighbor random walks on a variety of 2 
point homogeneous spaces. It is natural to inquire if a suitable stopping time can always be found. 
This problem is analyzed in Aldous and Diaconis (1985): let us merely state two results. 

We need to introduce a second notion of distance to the uniform distribution. Let Q be a 
probability on a finite group G. The separation of Q k * to the uniform distribution U after k steps 
is defined as 

s(k) =|G|max{[/(g) - Q k \g)}. 

g 

Clearly 0 < s(k) < 1 with s(k) = 0 if and only if Q k * = U. The separation is an upper bound 
for the variation distance: 

d(k) < s(k) 

because 

\\Q k * ~U\\= £ {U(g)-Q k \g)}. 

g-Q k \g)<U(g) 

The following result generalizes Lemma 1. 

Theorem 6. If T is a strong uniform time for the random walk generated by Q on G, then 
(5.1) s(k) < P(T > k); all k > 0. 

Conversely, for every random walk there exists a randomized strong uniform time T such that 
(5.1) holds with equality. 

While separation and variation distance can differ, for random walk problems there is a sense 
in which they only differ by a factor of 2. For 0 < e < define 

<p(e) = 1 - (1 - 2e l / 2 )(l - e l/2 ) 2 

and observe that <j>(e) decreases as e decreases, and ~ 4e 1/2 as e -> 0. 

Theorem 7. For any distribution Q on any finite group G, f 

1 

s(2k) < <j>(2d( k)) : k > 1, provided d( k) < — . 

8 

Thus, if k steps suffice to make the variation distance small, at most 2 k steps are required to 
make the separation small. 

Coupling is a probabilistic technique closely related to strong uniform times which achieves the 
exact variation distance. The coupling technique applies to Markov chains far more general than 
random walks on groups: see Griffeath (1975, 1978), Pitman (1976), Athreya and Ney (1977). 

Random walk involves repeated convolution and it is natural to try to use Fourier analysis or 
its non-commutative analog, group representation. Such techniques can sometimes give very sharp 
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bounds. Letac (1981) and Takacs (1982) are readable surveys. Diaconis and Shahshahani (1981, 
1984) present further examples. Robbins and Bolker (1981) use other techniques. 

Despite this range of available techniques, there are some shuffling methods for which we do 
not have good results on how many shuffles are needed; for example: 

(i) Riffle shuffles where there is a tendency for successive cards to be dropped from opposite 
hands. 

(ii) Overhand shuffle. The deck is divided into K blocks in some random way, and the order 
of the blocks is reversed. 

(iii) Semi-random transposition. At the kth shuffle, transpose the k th card (counting modulo 
n) with a uniform random card. 

From a theoretical viewpoint, there are interesting questions concerning the cut-off phenomenon. 
This occurs in all the examples we can explicitly calculate, but we know no general result which 
says that the phenomenon must happen for all “reasonable” shuffling methods. 

Acknowledgment. We thank Brad Efron, Leo Flatto, and Larry Shepp for help with Example 
1, and Jim Reeds for help with Section 4. 
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WHAT IS A DIFFERENTIAL? 

A NEW ANSWER FROM THE GENERALIZED RIEMANN INTEGRAL 
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Unlike derivatives which gained a solid basis in Cauchy’s theory of limits, differentials found 
no effective accommodation with the rising level of rigor in calculus. Justly castigated by Berkeley 
as “ghosts of departed quantities”, differentials clung fortuitously to the notational niche in 
calculus created for them by Leibniz. In this century they came to be presented as functionals on 
tangent spaces, a constricted role that made them respectable but evaded the issue of their wider 
role in integration. The resurrection of infinitesimals by nonstandard analysis rekindled interest in 
Leibniz’ original concept of differential. 

We present here a completely new approach to differentials in one dimension. This approach is 
motivated by the following considerations: (i) differentials spring directly from the integration 
process, (ii) the utility of differentials in integration extends beyond conventional differential 
forms, (iii) a viable theory of differentials is readily attainable by standard analysis, and (iv) the 
generalized Riemann integral fills a vital gap in analysis and should have an innovative impact on 
our calculus and real variables courses. In the theory expounded here differentials on a 1-cell K 
form a Riesz space (lattice-ordered linear space). So for each differential o we have the 
differentials 

|a| = a V - a, a + = a V 0, and o = ( — a) = -(a A 0) 


Solomon Leader : I wrote my Ph.D. thesis in analysis at Princeton in 1952 under the late Salomon Bochner. For 
the past 33 years I have been at Rutgers figuring out how calculus should be taught. My main interests have been in 
measure theory, integration, proximity spaces, and fixed points. In warm weather my favorite diversion is 
body-surfing off Long Beach Island. My wife and I enjoy snorkeling in the Virgin Islands and welcome any excuse 
to visit Switzerland. 



